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Abstract — Sparse Bayesian learning (SBL) is a popular ap- 
proach to sparse signal recovery in compressed sensing (CS). In 
SBL, the signal sparsity information is exploited by assuming 
a sparsity-inducing prior for the signal that is then estimated 
using Bayesian inference. In this paper, a new sparsity-inducing 
prior is introduced and efficient algorithms are developed for 
signal recovery. The main algorithm is shown to produce a 
sparser solution than existing SBL methods while preserving 
their desirable properties. Numerical simulations with one- 
dimensional synthetic signals and two-dimensional images verify 
our analysis and show that for sparse signals the proposed 
algorithm outperforms its SBL peers in both the signal recovery 
accuracy and computational speed. Its improved performance 
is also demonstrated in comparison with other state-of-the-art 
methods in CS. 

Index Terms — Compressed sensing, G-STG prior, greedy algo- 
rithm, sparse Bayesian learning. 



I. Introduction 

In practice, one would like to determine a high-dimensional 
signal x £ R N from its relatively low-dimensional linear 
measurements y = Ax E R M , where A e M MxAr and 
M < N. There exist infinite candidates of x that satisfy 
the linear equation in general. To possibly recover the true 
signal, some additional information needs to be taken into 
account. Fortunately, most signals of interest are sparse or 
compressible under appropriate bases, e.g., an image under a 
wavelet basis. Without loss of generality, we consider a signal 
that is sparse under the canonical basis since any sparsifying 
transform of x can be absorbed into the matrix A. So we 
need to search for the maximally sparse solution to the linear 
equation, i.e., to minimize ||a;|| that counts the number of 
nonzero entries of x subject to the constraint Ax = y. 
Since this combinatorial optimization problem is NP-hard, 
its convex relaxation (replacing ||cc|| by ||o;|| x ) coined as 
basis pursuit (BP) has been extensively studied (see 1 1 1 and 
references therein). During the past several years, the sparse 
signal recovery problem has been developed into a well-known 
research area named as compressed sensing (CS) G), which 
has wide applications including medical imaging BJ, source 
localization (4| and single-pixel camera (5), to name just a 
few. In CS, one wishes to recover a sparse signal x from 
its compressive measurements y = Ax. The sensing matrix 
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A is typically generated from a random distribution such 
that it satisfies some desirable properties such as restricted 
isometry property (RIP) p). It is shown that lx optimization 
can recover the true signal exactly under mild conditions. In 
addition, i\ optimization is robust to measurement noises and 
works efficiently with compressible signals [6|. While the l\ 
norm used to promote sparsity is a convex approximation to 
the original £q norm, nonconvex optimizations have also been 
studied. It is shown in (7), (SJ that improved performance 
can be obtained using the i p (0 < p < 1) norm. The 
nonconvex objective function J^ili 1°S (\ x i \ + r ) w i m T — 
is related to reweighted £\ minimization in [9|. Another class 
of approaches to CS uses a greedy pursuit method including 
OMP 1 10] and StOMP (TTJ. In a greedy algorithm, the support 
of the solution is modified sequentially with local optimization 
in each step. 

This paper is focused on Bayesian approaches to CS, known 
as Bayesian CS p2) . This method was originated from the 
area of machine leaning and introduced by Tipping [ 1 3 1 for 
obtaining sparse solutions to regression and classification tasks 
that use models which are linear in the parameters, coined as 
relevance vector machine (RVM) or sparse Bayesian learning 
(SBL). SBL is built upon a statistical perspective where the 
sparsity information is exploited by assuming a sparsity- 
inducing prior for the signal of interest that is then estimated 
via Bayesian inference. Its theoretical performance is analyzed 
by Wipf and Rao p4) . After being introduced into CS by Ji et 
al. fT2) , this technique has become a popular approach to CS 
and other sparsity-related problems. For example, a Bayesian 
framework is presented in JT5) for MEG/EEG source imaging. 
In fl6) , the authors introduce a framework to unify the CS 
problems with multi- and one-bit quantized measurements and 
estimate the sparse signals of interest and quantization errors 
based on a Bayesian formulation. Currently, main research 
topics in SBL for CS include (a) developing efficient sparsity- 
inducing priors, (b) incorporating additional signal structures 
in the prior besides sparsity and (c) designing fast and accurate 
inference algorithms. This paper studies problems (a) and (c). 
Examples of (b) include [ 17 1 that exploits the wavelet structure 
of images and fT8) on cluster structured sparse signals. 

Many sparsity-inducing priors have been studied in the 
literature. In fT7) , fl8) , a spike-and-slab prior |19] is applied 
which is a mixture of a point mass at zero and a continuous 
distribution elsewhere and fits naturally for sparse signals. A 
typical inference scheme for such a prior is a Markov chain 
Monte Carlo (MCMC) method (20) due to the lack of closed- 
form expressions of Bayesian estimators. As a result, the 
inference process may suffer from computational difficulties 
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because a large number of samples are required to approximate 
the posterior distribution and the convergence is typically slow. 
A popular class of sparsity-inducing priors is introduced in a 
hierarchical framework where a complex prior is composed 
of two or more simple distributions. For example, a Student's 
t-prior (or Gaussian-inverse gamma prior) is used in the basic 
SBL [13] that is composed of a Gaussian prior in the first 
layer and a gamma prior in the second. A Laplace (Gaussian- 
exponential) prior is used in ETJ. A Gaussian-gamma prior 
is recently studied in |22) that generalizes the Laplace prior. 
Two popular inference methods for the hierarchical priors are 
evidence procedure J23) , e.g., in fT3) , pT) , and variational 
Bayesian inference [24|, e.g., in [25]. Both the methods are 
approximations of Bayesian inference since the exact inference 
is intractable. In an evidence procedure, the signal estimator 
has a simple expression in which some unknown hyperpa- 
rameters are involved and estimated iteratively by maximizing 
their evidence. In a variational Bayesian inference method, the 
posterior distribution is approximated using some family of 
tractable distributions followed by computation of an optimal 
distribution within the family. To circumvent high-dimensional 
matrix inversions, a fast algorithm framework is developed in 
| [26| for evidence procedure and also adopted in |21 ]. But it is 
observed in this paper and p2[ that the fast algorithms in [21 1, 
p6| typically produce signal estimates with overestimated 
support sizes, especially in a low signal to noise ratio (SNR) 
regime or in the case of a large sample size. A new sparsity- 
inducing prior and algorithm are proposed in this work to 
resolve this problem with improved convergence speed. 

Though formulated from a different perspective, SBL is 
related to other approaches to CS. Consider the observation 
model y = Ax + e where e represents an additive white 
Gaussian noise (AWGN). Let p (x) be the prior for x. Then, 
a maximum a posteriori (MAP) estimator of x coincides 
with a solution to a regularized least-squares problem with 
— \ogp(x) (up to a scale) being the regularization term, 
which bridges SBL and optimization methods. For example, a 
Laplace prior corresponds to the widely studied l\ minimiza- 
tion. A prior corresponding to the nonconvex l v (0 < p < 1) 
norm is studied in [27|. The fast algorithm in [26] is related 
to the greedy pursuit method. In fact, it is a greedy algorithm 
using a different support modification criterion. Unlike OMP 
and StOMP, it allows deletion of irrelevant basis vectors that 
may have been added to the solution support in earlier steps. 

In this paper, we introduce a new sparsity-inducing prior 
named as Gaussian shifted-truncated-gamma (G-STG) prior 
that generalizes the Gaussian-gamma prior in (22). The ex- 
tended flexibility of the new prior promotes its capability 
of modeling sparse signals. In fact, it is shown that the 
Gaussian-gamma prior cannot work in the main algorithm 
of this paper. From the perspective of MAP estimation, the 
G-STG prior corresponds to a nonconvex objective function 
in optimization methods in general. For signal recovery we 
propose an iterative algorithm based on an evidence procedure 
and a fast greedy algorithm inspired by the algorithm in 1 26 1 . 
We show that similar theoretical guarantees shown in fl4[ hold 
for the new SBL method as for the basic SBL. Specifically, 
we show that every local optimum of the SBL cost function 



is achieved at a sparse solution and that the global optimum 
is achieved at the maximally sparse solution. Moreover, we 
show that the proposed algorithm produces a sparser solution 
than existing SBL methods. We provide simulation results 
with one-dimensional synthetic signals and two-dimensional 
images that verify our analysis. We compare our proposed 
method with state-of-the-art ones to illustrate its improved 
performance. 

Notations used in this paper are as follows. Bold-face letters 
are reserved for vectors and matrices. For ease of exposition, 
we do not distinguish a random variable from its numerical 
value. Xi is the ith entry of a vector x. ||cc|| counts the number 
of nonzero entries of x. ||a;|| = Q]\ |xi| p ) 1 ^ p for p > 
denotes the £ p norm (or pseudo-norm) of x. diag (x) denotes 
a diagonal matrix with diagonal entries being the elements of 
the vector x. A\j denotes the (i, j)th entry of a matrix A. \ A\ 
denotes the determinant of the matrix A. E {v} denotes the 
expectation of a random variable v. 

The rest of the paper is organized as follows. Section |H| 
introduces the G-STG prior. Section |HI| presents the Bayesian 
framework, an iterative procedure for signal recovery, and the- 
oretical results of the new SBL method. Section HVl introduces 
a fast greedy algorithm with analysis. Section [V] provides em- 
pirical results to show the efficiency of the proposed solution. 
Section |VI] concludes this paper. 

II. G-STG Prior 

A. Mathematical Formulation 

We introduce the hierarchical Gaussian shifted-truncated- 
gamma (G-STG) prior for a sparse signal x £ as follows: 

p(x\a) = Af(x\0,A) , (1) 

N N 

p(a;T,e,r)) = J|p (a,; r, e, 77) = Y[T T (0^6,77) , (2) 
i=i i=i 

where A = diag (a), a. E R , p (a^; r, e, 77) is a shifted- 
truncated-gamma (STG) distribution for cti > with 

T T (a>i\e, 77) = 77 (cti + r)'" 1 exp {-77 (at + r)} . (3) 

e > is the shape parameter, 77 > is the rate parameter, 
r > is the threshold parameter and T T (e) = J r t t ~ 1 e~ t dt 
denotes an incomplete gamma function. The first layer of 
the prior is a commonly used Gaussian prior that leads to 
convenient computations as shown later. In the second layer 
ctfj, i = 1, • • • , N, are assumed to be independent, and further 
ctfj + r, i = 1, • • • , N, are i.i.d. truncated gamma distribution 
(that is why we say that p (on; r, e, 77) is an STG distribution). 
By p (on) oc (cti + r) e_1 exp {— rjcti}, i = 1, • • • , N, it is 
obvious that cti is favored to be zero in the second layer if 
e < 1, resulting in that Xi is favored to be zero. Thus the 
hierarchical prior is a sparsity-inducing prior. In general, there 
is no explicit expression for the marginal distribution 

N 

p(x;T,e,r)) = ]Jp r, e, 77) 

Jv />oo 

= TT / Af (xi\0,cti)r T (cti\e,r))dcti. 
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In the following we study some special cases and show that 
the G-STG prior generalizes those in pT) , (22). 

1) e = 1: In this case, the second layer is reduced to 
an exponential prior independent of r since r r (ai|l,?/) = 
^g-r/ai Then, the G-STG prior coincides with the Laplace 
prior in pTj with p (x,; r, 1, ?/) = ^/r//2 exp (— y/2rj |£j|). 

2) t = 0: The second layer becomes a gamma 
prior for each at. As a result, the proposed prior be- 
comes the Gaussian-gamma prior in fl22| and p(xi) = 

2 3/4-e/2 2t + l 



v^r( £ 



V 



fc e - 1 (V^V where /C„ (•) is the 



modified Bessel function of the second kind and order v € HL 
In addition, we have that p (0) = +oo if e < | and 
p (0) < +oo if e > Though the G-STG prior generalizes 
the Gaussian-gamma prior, it should be noted that the main 
algorithm based on the G-STG prior proposed in this paper 
works differently from that in (22). 

3) t —> +oo: By l'Hospital's rule it can be shown that 
lim r ^ +00 T T (cti\e,ri) = r\e~ r,ai , i.e., the prior for on in the 
second layer approaches an exponential prior in such a case. 
Consequently, the proposed G-STG prior coincides with the 
Laplace prior as in Case 1. 

To visualize the variation of the G-STG prior with respect 
to the two parameters r and e, we plot the PDF p (a^; r, e, rj) 
in Fig. [T] with rj = 1. Fig. 1(a) is for the case of e = 0.1 and 
varying r. Obviously, the G-STG prior is a sparsity-inducing 
prior with the PDFs highly peaked at the origin, especially 
when t — y 0. The main difference between the cases r = 
and t = 1 x 10~ 8 is near the origin where p(xi) approaches 
infinity for r = while it is always finite for r > 0. As r 
gets larger, less density concentrates near the origin and the 
resulting prior gets close r to the Laplace prior that corresponds 
to t = +oo. Fig. 1(b) is for the case of r = 1 x 1CP 8 and 
varying e. It is shown that the G-STG prior gets less sparsity- 
inducing as e gets larger, and that it ceases to promote sparsity 
as e > 1. From the perspective of MAP estimation, the G- 
STG prior corresponds to a nonconvex optimization method 
as e < 1 since the term — logp(x) is nonconvex in such a 
case. 
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Fig. 1. 
and (b) 



PDFs of the G-STG prior in the case of (a) e = 0.1 and varying r 
t = 1 X 10 -8 and varying e with t] = 1. 



B. Intuitive Interpretation and Threshold Parameter Setting 

Strictly speaking, a continuous prior is not suitable for 
sparse signals since any vector generated from a continuous 
prior is only approximately sparse (the probability of a zero- 
valued entry is zero). In the following, we provide an intuitive 
explanation about why the G-STG prior works for sparse 
signals by setting the threshold parameter r according to the 
noise level. 

We consider a Gaussian ensemble sensing matrix A (the 
entries of A are i.i.d. Gaussian Af (0, Af" 1 ) where the vari- 
ance is set to M~ l to make columns of A have expected 
unit norm). This matrix ensemble has been widely studied, 
e.g., in (28), (29). Then, consider a compressible signal 
z and the observation model y — Az. In the Bayesian 
framework, we assume that z is distributed according to some 
sparsity-inducing prior. Here we adopt the Gaussian-gamma 
prior in [22|, i.e., we assume that p(z\/3) = Af(z\0,B) 
and p(P;e,r]) = ]lti r (AM) where B = dia S C 9 ) 



and r(/3 i |e, 77) refers to T T (f3 i \e,rj) with r = 0. However, 
theoretical results (6j, pO) state that only significant entries of 
z can be recovered while insignificant ones play as noises. So 
we write z into z = x + w where x denotes the significant, 
"recoverable" component and w refers to the insignificant, 
"unrecoverable" part. Then the observation model becomes 
y = A (x + w) = Ax + e with e = Aw. By the structure 
of A, e is a zero-mean AWGN with the noise variance 
a 1 — M^ 1 ||to|| 2 . We see that only the power of w is 
reflected in the noise. That is, for any vector w satisfying 
||iij|| 2 = || tu 1 1 2, Aw and e are identically distributed. So we 
may replace w by w in the observation model (i.e., x + w 
and z are indistinguishable for CS approaches). Then we may 
model w as an i.i.d. zero-mean Gaussian vector with variance 
t = N- 1 \\w\\ 2 = (M/N) a 2 . In addition, w is independent 
of x. So, under the assumption of a Gaussian ensemble matrix 
A, a compressible signal z is equivalent to the sum of its 
significant part x plus a white Gaussian noise w. Applying 
the Gaussian-gamma prior for z to x + w, we obtain that 
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fa — ai + t, i = 1, • • • , TV, where a is as defined in Q. 
Then we get p(c^) = p (/3j — r|/3j > r) as a conditional 
distribution, resulting in that p (a) is in the exact form of 

In this paper, we mainly consider the observation model 
y = Ax + e where a; is a sparse signal and e is a zero-mean 
AWGN with known variance a 2 . The same sparsity-inducing 
prior for x can be obtained by a reverse procedure and the 
details are omitted. So we can set the threshold parameter 
t = (M /N) a 2 in the G-STG prior. Though this setting is 
only based on intuition without rigorous analysis, it indeed 
leads to good performance as to be reported via simulations 
in Section [V] where it is also observed that this setting applies 
to other sensing matrix ensembles besides the Gaussian one. 

III. Sparse Bayesian Learning for Signal Recovery 

A. Bayesian Formulation 

In SBL, the signal of interest and noise are modeled as 
random variables. Under the common assumption of zero- 
mean AWGNs, i.e., e ~ J\f (0,a 2 l), where a 2 is the noise 
variance, the PDF of the compressive measurements is 



p(y\x;cr 2 ) =N (y\Ax,a 2 l) 



(5) 



In this paper we assume that the noise variance a 2 is known 
a priori. This assumption has been widely made in the CS 
literature, e.g., (6), pT[ . Moreover, it is shown in [32[ that to 
estimate a 2 jointly with the signal recovery process (e.g., in 
(13)) can lead to very inaccurate estimate. 

The G-STG prior introduced in Section [II] is adopted 
as the sparsity-inducing prior for the sparse signal x. The 
hyperparameters r and e are chosen manually according to 
the reasoning in Section [II] and Subsection |IV-B| Numer- 
ical simulations will be provided in SectionTvf to illus- 
trate their performance. To estimate 77 from the measure- 
ments, we assume a gamma hyperprior for 77: p (77; c, d) = 
T(rj\c,d), where we let c,d — > to obtain a uniform 
hyperprior (over a logarithmic scale). So the joint PDF 
of the observation model is p (y,x,a,ri;o- 2 ,T,e,c,d) = 
p (y\x; a -2 ) p{x\a)p (a.\rj; r, e) p (77; c, d), where y is the ob- 
servation, x is the unknown signal of interest, a and 77 are 
unknown parameters, and a 2 , r, e, c and d are fixed. The task 
is to estimate x. 



B. Bayesian Inference 

Note that the exact Bayesian inference is intractable since 
p(x\y) is computationally intractable. Some approximations 
have to be made. Following from fl3| , we decompose the 
posterior p (x, a, i]\y) into two terms as 



p(x,a,r/\y) = p (x\y, a) p (a, rj\y) . 



(6) 



The first term p(x\y, a) is the posterior for x given the 
hyperparameter a, which will be later shown to be a Gaussian 
PDF. Then, we compute the most probable estimates of a 
and 77, say (Xmp and 77MP, that maximize the second term 
p(ot,i]\y). We use q.mp to obtain the posterior for x. From 



the perspective of signal estimation, this is equivalent to 
requiring 

P{x\y) = J P (x, a, r]\y) dotd-qRip {x\y, a MP ) , (7) 

where p (x\y, olmp) refers to p(x\y,ot) at olmp- Similar 
procedures have been adopted in p4) , pT) . We will provide 
theoretical evidence to show that this approach leads to desir- 
able properties in Section |III-C| Simulation results presented 
in Section [V] also suggest that the signal recovery based on 
this approximation is very effective. 

Since p(y\x) and p(x\a) are both Gaussian, it is easy to 
show that the posterior for x and the marginal distribution 
for y are both Gaussian with p(x\y,ot) = Af (x\fi, S), 
p (y\a) = Af (y\0, C), where 



S 



a- 2 HA T yi 



a- 2 A 7 A 



a 2 1 + AAA 1 



(8) 
(9) 
(10) 



The maximization of p(at, r]\y) is equivalent to that 
of p(y,a,r}) = p (y\ot) p (a\rf) p (77) by the relation 
p(a, rj\y) — p(y,a,r/) /p(y). We consider log?7 as the 
hidden variable instead of 77 since the uniform hyperprior is 
assumed over a logarithmic scale. By p (log rj) — r\p (77) we 
see that the hyperprior for 77 leads to a noninformative prior 
by setting c = d = 0. So the log-likelihood function is 

C (a, log 77) 
= logp(y,a,log77) 



--log|C7| 



lv T C- x y 



N 



N 



(11) 



=1 



=1 



+ (Ne + c) log 77 - iVloglV (e) - drj + C u 
where C\ is a constant. The maximizer of £ will be analyzed 



in Subsection III-C In the following we provide an itera- 



tive procedure to maximize C by recognizing the identities 
log |C| = log |A| + A/logff 2 - log |S| and y T C- 1 y = 



, N, we have 

e - 1 



a~ 2 \\y - AaC + A* TA V- 

1) Update of a: For ai, i — 1, • • 

dC _ 1 E{x 2 } 



da; 



2a, 2a 2 
2aj (a, + r) ' 



V 



(12) 



where / (t) = 2rjt 3 + (3 - 2e + 2tjt) t 2 + (t - E {x 2 }) t - 
tE {a; 2 } for t e R is a cubic function and E {xfj = /j 2 + 
Y>a, where is the ith diagonal entry of S. We need the 
following lemma. 

Lemma 1: For a cubic function g if) = \it 3 + X 2 t 2 + X 3 t + 
A4, if Ai, A2 > and A4 < 0, then g (i) = has a unique 
root on (0, +00). 

Proof: By g(0) = A 4 < and lim t _j. +00 g (t) — +00 
there exists at least one root in (0, +00). We show that this 
root is unique using contradiction. Suppose there exists more 
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than one positive root. Then there must exist three positive 
roots and that g (i) has two positive stationary points. That is, 
the two solutions of = 3Aii 2 + 2X 2 t + A 3 = are both 
positive, resulting in that A2 < (contradiction). ■ 
By Lemma [T] it is easy to show that the maximum of C is 
achieved at the unique positive root, say a* > 0, of / (a*) = 0. 
We note that explicit expressions are available for the roots of 
a cubic function and hence on, i = 1, • • ■ , N, can be efficiently 
updated. 

2) Update of rj: In general, there is no explicit expression 
for updating rj. Since the first and second derivatives of C 
with respect to log 77 can be easily computed, C can be 
efficiently maximized with respect to log 77 using numerical 
methods, e.g., gradient ascending method or Newton's method. 
In addition, the computational complexity hardly depends on 
the CS problem dimension. 

Remark 1: The computation of the incomplete gamma 
function T VT (e) is involved in the update of 77. This term can 
be efficiently computed using functions provided in Matlab 
if e is properly bounded away from zero. But a numerical 
integration is needed if e = 0. In Section [VJ we observe 
through simulations that the update of 77 may take considerably 
long time in the case of e = 0. But such differences are 
negligible in the case of a high-dimensional CS problem 
since the computation of 77 hardly depends on the problem 
dimension unlike other computations, such as the update of 
a. 

As a result, an iterative algorithm can be implemented to 
obtain q.mp and 77MP by iteratively updating £ in (|9j, ji in 
dSJ, a and 77. It is easy to show that this algorithm can be 
implemented using an EM algorithm [33]. So the likelihood 
C increases monotonically at each iteration and the algorithm 
is guaranteed to converge. After convergence, the signal x 
is estimated using its posterior mean /j,. One shortcoming 
of the iterative algorithm is that at each iteration a high- 
dimensional matrix inversion has to be calculated for updating 
S though this computation can be possibly alleviated using the 
Woodbury matrix identity. 

C. Analysis of Global and Local Maxima 

We analyze the global and local maxima of the likelihood 
C in (JTTJ in this subsection. Our analysis is rooted in [14] and 
shows that the theoretical results on the basic SBL in [14] can 
be extended to our case with necessary modifications. In the 
following, we assume that c = d = and 77 > is fixed (it 
is a similar case if 77 is chosen to maximize L as well). Thus 
we may write C (with respect to a) as 



C{a) = - l -\og\C\- l -y J C 



y 



N 

+ (e - 1) J2 lo S ^ + T ) 

=1 



N 

i]^ai + C 4 , 

i=l 



(13) 



where C4 is a constant independent of a. We first consider 
the global maxima in the noise free case. 

Theorem 1: Let r > 0, < e < 1, e = and a 2 = 0. 
Assume that a; satisfying ||a; || < M < N is the maximally 
sparse solution to the linear equation y = Ax. Then, there 



exists some a with a < +00 and \\a° L = \\x° L such 

II 1 1 z II IIU II IIU 

that at or, C is globally maximized and the corresponding 
posterior mean is /j, = x°. 

1 / 1 \ t 

Proof: Note first that lim 

,t 



x° at a 



»of = A J [AA' 2 y = 



A 2 (AA 2 j Ax° by (81. Since || x° || Q < M there must exist 

|a; || , such that fj, ■■ 



\ a o < +°° ar, d \\ a 



ot a . In fact, a can be any vector satisfying that for 
, N, a" > if x° ^ 0, and a° = otherwise. In the 
following we show that the global maximum of C is achieved 
at a — a . Let = ||o: || < 00 and denote Cp. > the 

II Moo u 

minimum nonzero a,;, i = 1, • • • , N. Following from the proof 

a 2 1 + AAA T -> and 



of [14 

T 

y 



Theorem 1], we have |C| 



C- X y < 



if a. 



q°. In addition, we have 



II 2 / C 6 

EjLi log («<+'') < Nlog(C 5 +T) and Ef=i"i < NC 5 . 
So we have £ 



-DC 



at a = a°, which completes the proof. 



Theorem [T] shows that the global maximum of the objective 
function is achieved at the maximally sparse solution. Thus the 
proposed approach has no structural errors and the remaining 
question of whether the algorithm can produce this solution 
is a convergence issue. Further, we have the following result 
which is similar to that in fPt] Theorem 2]. 

Theorem 2: Let t > and < e < 1. Every local 
maximum of C is achieved at a sparse a with ||a|| < M 
that leads to a sparse posterior mean fi with \\fJ,\\ < M, 
regardless of the existence of noise. 

Proof: Note that \\fJ,\\ < \\ot\\ since p; L — > as on — > 
0, i = 1, • • • , N. So we need only to show that every local 
maximum of £ is achieved at a sparse a with ||a|| < M. Let 



q (a) = -i log |C| + (e - 1) Eti log («i + r) 



which is convex with respect to a € if e < 1. Suppose 
that a* is a local maximum point of C. We may construct 
a closed, bounded convex polytope V C following from 
fl4| (we omit the details) such that a* e V and if a G V, 
then the second term of C, —\y T C~ x y, equals a constant 

C 7 = -\y T [o 2 l + Adiag (a*) A T ^j y. In addition, all 
extreme points of V are sparse with support size no more 
than M . As a result, a* is a local maximum point of q (a) 
with respect to a 6 V. By the convexity of q (a), a* must 
be an extreme point of V with ||a*|| < M. ■ 
Theorem |2] states that all local maxima of C are achieved at 
sparse solutions. Since we can locally maximize C efficiently 
in practice, based on Theorem [2] we introduce a fast algorithm 
in Section IV that searches for a sparse solution that locally 



maximizes C. 



IV. Fast Algorithm 



A. Fast Greedy Algorithm 

Based on Theorem [2] in Section III-C the following algo- 
rithm aims to find a sparse solution that locally maximizes C 
We consider the contribution of a single basis vector a 3 (the 
jth column of A), j = 1, • • • , N, and determine whether it 
should be included in the model (or in the active set) for the 
maximization of C Denote C-j = a 2 I + 2~2i^j a i a i a f = 
C — ctjdjaj that is independent of the jth basis vector 



6 



aj. Then we have 

C- 1 = CZ) - f«7 ] 



\C\ = 
+ a?C 



\C- 



1 



-V 



and 



'Cz)a s a!jCz). Using 
the two identities above, we rewrite the log-likelihood function 
(with respect to a) into 

£(a) = —logical -iy 



T CZ)y 



+ (e-l) 



i=l 



1 > log (ai + r) — ?y > n, 



N 



i 



log 1 



2 

£(a-i) 



ctjdjc_ 



2(c 



C 3 



C 2 



(14) 



where £(a_j) denotes £ (a) after removing the contribu- 
tion of the jth basis vector, ^(ay) = — | log |1 + a.jSj\ + 



+ (e-l)log| 



l) — rjctj with Sj 



ajCl 



3 3 



1j 

and qj = aj C_jy, and C2, C3 are constants independent 
of a. Note that £ (ai) has been modified by a constant such 
that i (0) = 0. In the following we compute the maximum 
point, say a*, of £ (aj) on [0, +00). If a* > 0, then the basis 
vector aj should be preserved in the active set since it gives 
positive contribution to the likelihood. Otherwise, it should be 
removed from the active set (by setting aj = 0). We consider 
only the general case r>0, 0<e<l and 77 > since it is 
simple for other cases. First we have 

-,- 



d£( aj ) 



11 



1 



2(l + a jSj ) 2(l + a jSj y 



17 



h(a 3 ) 



2 (ay 



(1 



where h (aj 
of aj with 



Ciaf + c 2 a 2 



c 3 a. 



(15) 

C4 is a cubic function 



2 V sj, 

(3 - 2e) s) + 4r) Sj + 2r)TS 2 
(5 - 4e) Sj + 2ry - q 2 - 



ci 

C2 
C3 

c 4 = 2 - 2e 



2?7 - <£ 



To compute the maximum point of £(oy) 
following result. 

Lemma 2 ( [34]): For a cubic function g (t) 



we 



(16) 
(17) 
I, (18) 
(19) 

need the 



\ 2 t 2 + A 3 i + A4, let the discriminant A = I8A1A2A3A4 



4A|A 4 + A|A§ 



4AiA 3 



:! - 27A?Ai If A > 0, then g(t) 







has three distinct real roots. If A — 0, then the equation has 
three real roots including a multiple root. If A < 0, then the 
equation has one real root and two complex conjugate roots. 

Note that Sj > for j = 1, • • • , iV since C-j and Cz) 
are positive definite, and then ci,C2 > 0. We divide our 
discussions into three scenarios. 

1) C4 < 0: This case is the same as the update of a in 
By Lemma [I] h(aj) = has a unique 



Subsection 



III-B 



solution a* on (0, +00). Then it is easy to show that I (aj) in- 
creases monotonically on (0, a*\ and decreases monotonically 
) . Thus £ (aj ) obtains the maximum at a* > 0. 



2) a > 0, c 3 < and A > 0: Let A denote the 
determinant of h (aj). We see that h (aj) — has three real 
distinct roots by Lemma [2] since A > 0. The two stationary 
points of h (aj ) lie on different sides of the y-axis since 
d ^ J - > = 3cia 2 + 2c 2 ay +c 3 with c 3 < 0. Thus the three roots 
include one negative root and two distinct positive roots since 
h(0) — C4 > 0. Denote a'j > the largest root. We see that 
£(otj) decreases from aj = to some point, then increases 
until aj = a'j and decreases again. As a result, t(aj) obtains 
the maximum at or a'j. So we have 

• if £ [a'j) > 0, then the maximum point is a* = a'j > 0; 

• if I (a'j) < 0, then a* = 0; 

3) C4 > 0, and c 3 > or A < 0: In this scenario, the 
maximum of £(aj) is obtained at a* = 0. We divide our 
discussions into three cases: 7JA<0, 2JA = 0, and 3) A > 
and c 3 > 0. In Case 1, h (aj) — has only one (negative) 
real root. In Case 2, the equation has three negative roots (two 
of them coincide), or one negative root and a multiple positive 
root. In Case 3 the equation has three distinct negative roots. 
So in all the cases £' (aj ) < for aj > 0, resulting in that 
£(aj) decreases monotonically on [0, +00). 

Based on the analysis above, we can compute efficiently 
a* (given Sj and qj) at which the likelihood is maximized 
with respect to a single basis vector aj, j — 1, ■ ■ • ,N. So, 
the likelihood consistently increases if we update a single 
aj at one time with the basis aj, j — l, -- ,N, properly 
chosen. As a result, a greedy algorithm can be implemented. 
At the beginning, no basis vectors are included in the model 
(i.e., the active set is empty or all aj — 0). Then choose 
a vector aj at each iteration such that it gives the largest 
likelihood increment by updating aj from its current value 
a'j to the maximum point a*. If a° = and a* > 0, then 
the basis vector aj is added to the model with aj — a*. If 
a^ > and a* > 0, then aj is re-estimated in the model. 
If a? > and a* = 0, then a„ is removed from the model 



with aj = 0. After that, update rj as in Subsection III-B The 



process is repeated until convergence (that is guaranteed since 



£ increases monotonically). Based on the results of |26], we 



see that /x, S, Sj and qj, j — 1, • • • , N, can be efficiently 
updated without matrix inversions. Consequently, the greedy 
algorithm is computationally efficient. 

Remark 2: The main differences between the proposed al- 
gorithm and those in pT) , p6| are the updates of a and r\. 
Since roots of a cubic equation have explicit expressions and 
the update of 77 hardly depends on the problem dimension, 
the proposed greedy algorithm has the same computational 
complexity as those in |21 1, p6| at each iteration. 



B. Analysis of Basis Selection Condition 

We study the basis selection condition of the fast algorithm 
in more details in this subsection. Based on the analysis in 



Subsection IV-A we have the following result. 

Proposition 1: Suppose that the log-likelihood £ has 
been locally maximized in the fast algorithm. If aj > 



for 



some 
min { Si -| 



basis vector 

2-2e 



a; 



J 



3 ' 



2ry + (5 - 4e) Sj + 2ry + r (A-qs 

where Sj, qj are as defined in Subsection |IV-A| 



then 



> 



7 



> S 
o2 



Proof: Note that the inequalities q 
and q 2 > (5 - 4e) s 3 + 2?/ + r (4?ySj 
to C4 < and c 3 < respectively. Let us suppose that the 
conclusion does not hold. Then we have C4 > and C3 > 0. 



are equivalent 



It follows from the analysis in Subsection IV-A3 that the 
likelihood increases if aj is removed from the model, leading 
to contradiction. ■ 
Remark 3: Proposition [T]provides a necessary condition for 
the basis vectors in the final active set. Note that this con- 
dition is generally insufficient since additional requirements 
are needed in the case of C4 > (e.g., A > 0) according 



2 ^ 
qi > a 



to the analysis in Subsection IV-A In a special case where 
2fj -f 2^2e jjolds, we can conclude that a,j is in the 



model according to Subsection IV-A1 



The basis selection condition concerns the sparsity level of 
the solution of the fast algorithm. We have illustrated that 
different settings of r and e lead to different sparsity-inducing 
priors in Section [II] In the following we will see how the 
parameters affect the sparsity of the algorithm solution. We 
discuss the effects of r and e separately. 

Let us first fix r > and vary e e [0, 1]. It is easy to 



show that as e decreases both the terms sj 



2rj 



2-2e 



and 



(5 — 4e) Sj + 2r/ + T (4r)Sj + s|j increase. Thus the necessary 
condition in Proposition [T] becomes stronger. As a result, the 
solution of the greedy algorithm will be sparser, which is 
consistent with the fact that a smaller e leads to a more 
sparsity-inducing prior as shown in Section [TT] In the case of 



e = 1, the inequality turns to be q 2 > 



Sj + 



2rj that coincides 



with the result in pT) . So, the proposed algorithm will produce 
a sparser solution than that of [21] by simply setting e < 1. 

It is not obvious for the case of fixed e < 1 and varying 
r (the case e = 1 has been discussed before) since, as 
r decreases, the first term Sj + 2rj + increases while 
the second term (5 — 4e) Sj + 2i] + t (irjSj + s|) decreases. 
To make a correct conclusion, we observe that i(af) in 
Subsection |IV-A| is a strictly increasing function with respect 
to t > for any fixed aj > and keeps equal to zero at 
OLj = 0. As a result, as r decreases, it is less likely that the 
maximum of £(af) is achived at a positive point, resulting 
in that the solution of the greedy algorithm gets sparser. This 
is consistent with that a smaller r leads to a more sparsity- 
inducing prior as shown in Section [II] In fact, the greedy 
algorithm produces a zero solution if r = and e < 1 since in 
such a case the maximum of the likelihood is always achieved 
at the origin with respect to every basis vector. So intuitively, 
a smaller r should be used to obtain a more sparsity-inducing 
prior but too small r may lead to inaccuracy for the greedy 
algorithm. Numerical simulations in Section [V] will illustrate 
that the recommended r in Subsection II-B| is a good choice. 

Remark 4: In the case of r = 0, the proposed G-STG 
prior coincides with the Gaussian-gamma prior in [22|. The 
greedy algorithm using this prior developed in \22 \ is claimed 
to follow from the same framework in (26) and maximize 
the likelihood sequentially. However, it should be noted that 
the algorithm in p2| does not really maximize the likelihood 
since, if it does, then it should produce a zero solution as 



discussed above. Specifically, the authors of [22] compute only 
a local maximum point of I (aj ) which cannot guarantee to 



increase the likelihood C while the global maximum of £ (aj) 
is always obtained at the origin. Hence the algorithm in J22J is 
technically incorrect because of the inappropriate basis update 
scheme. Moreover, the algorithm in p2[ has not been shown 
to provide guaranteed convergence. 

V. Numerical Simulations 

In this section, we present numerical results to illustrate 
the performance of the proposed method (we consider only 



the fast algorithm in Subsection IV-Ai. We consider both 
one-dimensional synthetic signals and two-dimensional im- 
ages, and compare with existing methods, including £\ op- 
timization (BP or BPDN), reweighted (RW-) l\ optimiza- 
tion [9 1, StOMP (TT), the basic BCS (26) and BCS with 
the Laplace prior (denoted by Laplace) pT) . BCS, Laplace 
and the proposed method are SBL methods, li optimiza- 
tion is a convex optimization method. RW-£i is related to 
nonconvex optimization. StOMP is a greedy method. The 
Matlab codes of BP, StOMP and BCS are obtained from 
the SparseLab packag^j] and that of BPDN is from the 
fi-magic packag^] The code of Laplace is available at 
https://netfiles.uiuc.edu/dbabacan/www/links.html. The num- 
ber of iterations is set to 5 for RW-£i (i.e., 5 t\ minimization 
problems are solved iteratively). To make a fair comparison, 
we use the same convergence criterion in the proposed method 
as in BCS and Laplace with the stopping tolerance set to 
1 x 1CT 8 . 

The performance metrics adopted include the relative mean 
squared error (RMSE, calculated by 1 1 SB — a;|| 2 / ||a;|| 2 ), the 
support size (||x|L), the number of iterations (for the three 
BCS methods) and the CPU time, where x and x denote the 
recovered and original signals, respectively. 



A. One-Dimensional Synthetic Signals 

1) Performance with respect to t: An explicit determina- 
tion of t has been recommended in Subsection IH-BI In the 
following, we show that, indeed, this setting leads to good 
performance in the signal recovery. In our simulation, we 
set the signal length N = 512 and the number of nonzero 
entries K — 20, and vary the sample size M from 40 to 
140 with step size of 5. The nonzero entries of the sparse 
signal are randomly located with the amplitudes following 
from a zero-mean unit-variance Gaussian distribution. We 
consider two matrix ensembles for the sensing matrix A, 
including Gaussian ensemble and uniform spherical ensemble 
(with columns uniformly distributed on the sphere S^ -1 ). 
To obtain the desired SNR, zero-mean AWGNs are added to 
the linear measurements where the noise variance is set to 
a 2 = (K/M) io~ SNR / 10 . We set SNR = 25 dB. The noisy 
measurements are used in the following signal recovery pro- 
cess. In the proposed algorithm, we set e = 0.01 which results 
in both fast and accurate recovery (this will be illustrated in the 
next experiment). Denote tq = (M/N)a 2 . We set r = 9tq 
and consider six values of 6 = 10~ 4 , 10~ 2 , 10 _1 , 1, 10 



'Available at http://sparselab.stanford.edu 



2 Available at http://users.ece.gatech.edu/~justin/llmag: 
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and 100. Thus — I leads to the recommended value of r 



in Subsection II-B For each M, 100 random problems are 
generated and solved respectively using the proposed method 
with different t. The metrics are averaged results over the 100 
trials. 

Our simulation results are presented in Fig. [2] Fig. 2(a) and 
Fig. 2(b) plot the RMSEs and support sizes of the proposed 
algorithm with Gaussian sensing matrices. It is shown that 
the recommended t leads to approximately the smallest error 
with a reasonable number of measurements while the errors are 
almost the same when the sample size is small for different 
Fig. 2(b) shows that the recommended r results in the 



T'S. 



most accurate estimation of the support size in most cases. 
In addition, it is shown that a sparser solution is obtained 
if a smaller r is used in the algorithm as expected. Almost 



identical performance is shown in Fig. 2(c) and 2(d) by using 
the uniform spherical ensemble. Thus, we consider only the 
uniform spherical ensemble in the following experiments. 

2) Performance with respect to e: We study now the 
performance of the proposed algorithm with respect to e. We 
repeat the simulation above using the recommended r and 
consider five values of e = 0, 0.01, 0.1, 0.5 and 1. Note that 
the case e = 1 corresponds to the Laplace prior. Our simulation 
results are presented in Fig. [3] It is shown in Fig. 3(a) that 
the signal recovery error decays as the sample size increases 
in general. As the sample size is small the estimation errors 
differ slightly. But with a reasonable number of measurements 



a smaller e results in a smaller error. It is shown in Fig. 3(b) 



that a smaller e leads to a sparser solution as expected and 
more accurate support size estimation. Another advantage of 



adopting a small e can be observed in Fig. 3(c) where it is 
shown that a smaller e leads to less number of iterations. In 
general, the time consumption is proportional to the number of 
iterations since the computational workload is approximately 
the same at each iteration. Fig. |3(d) shows an exception at 
e = as illustrated in Remark [T] In this case, the update of ij 
takes most of the computational time in our simulation. Since 



it is shown in Figs. 3(a) - 3(c) that the performance at e = 



and 0.01 is hardly distinguishable, we use e = 0.01 in the rest 
simulations. 

3) Comparison with existing methods: We consider two 
simulation setups. In the first case we repeat the simulations 
above (i.e., we fix the SNR = 25 dB and vary M). In each 
trial, all methods share the same data. We adopt the FAR 
thresholding strategy in StOMR Our simulation results are 
presented in Fig. [4] It is shown in Fig. |4(a)| that the recon- 
struction errors of the three SBL methods (BCS, Laplace and 
our proposed method) are very close to each other and larger 
than those of BPDN, RW-BPDN and StOMP if the sample 
size is small. With a reasonable sample size it can be seen that 
our proposed method has the smallest error. Fig. |4(b)| shows 
the average support size of the recovered signal. The results 
of BPDN and RW-BPDN are omitted since they are global 
optimization methods and their numerical solutions have no 
exact zero entries. In general, the estimated support sizes of 
StOMP, BCS and Laplace increase with the sample size. As 
expected the proposed method produces sparser solutions than 
BCS and Laplace. It is shown that the proposed method can 
accurately estimate the support of the sparse signal in most 



cases and has the best performance. Fig. 4(c) plots the number 
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of iterations of the three SBL methods, where it is shown that 
the proposed one uses the least number of iterations and thus 
is the fastest one in computational speed. On average, StOMP 
uses the least computation time (about 0.01s), followed by the 
SBL methods (from 0.06 to 0.1s), and then BPDN (about Is) 
and RW-BPDN (about 2s). We note that a number of solvers 
have been proposed to solve the BPDN problem with improved 
speed, e.g., SPGL1 (35). 

In the next simulation we set the sample size M = 120 
and vary the SNR from to 50dB with step size of 5dB. 
The simulation results are presented in Fig. [5] Fig. 5(a) shows 
that the proposed method has consistently the smallest signal 



TABLE I 

Averaged Relative MSEs, CPU Times and Number of Nonzero 
Entries (mean ± standard deviation) for Multiscale CS 
Reconstruction of the Mondrian Image. 



recovery error. Fig. 5(b) shows that the proposed method 
produces the sparsest solution and the most accurate support 



size estimation. Fig. 5(c) shows that among the three SBL 
methods the proposed one uses the least number of iterations at 
all SNR levels. In the low SNR regime, it can be 6 and 3 times 
less in comparison with Laplace and BCS respectively, leading 
to that the proposed method is much faster than Laplace and 
BCS. 

In summary, the proposed method has improved perfor- 
mance for sparse signals in comparison with existing ones. 
It outperforms its SBL peers in both signal recovery accuracy 
and computational speed. 

B. Images 

In this section, we revisit the widely used multiscale CS 
reconstruction [36] of the 512 x 512 Mondrian image in 
SparseLab. We use the same simulation setup, i.e., we choose 
the "symmlet8" wavelet as the sparsifying basis with a coarsest 
scale jo — 4 and a finest scale ji — 6. The number of 
wavelet samples is N = 4096 and the sample size of CS 
methods is M = 2713. The parameters of BP and StOMP 
with the FDR and FAR thresholding strategies (denoted by 
FDR and FAR respectively) are set as in SparseLab. Since the 
wavelet expansion of the Mondrian image is compressible but 
not exactly sparse, we set a 2 — O.Olt^ar (y) in Laplace and 
our proposed method as in BCS, where Var (y) denotes the 
variance of the entries of y. 

Table [I] presents the experimental results over 100 trials. 
Linear reconstruction from 4096 wavelet samples has a recon- 
struction error of 0.1333 that represents a lower bound of the 
error of the considered CS methods. The global optimization 
method BP has the smallest error among the CS methods, 
followed by BCS, Laplace, the proposed method and StOMP. 
The presented results verify again that the proposed method 
produces a sparser solution than BCS and Laplace. In fact, it 
produces the sparsest solution among all the methods. So it 
is reasonable that the proposed method has a slightly worse 
reconstruction error in comparison with BCS and Laplace 
since the original signal is not exactly sparse. We note that 
the proposed method is faster than BCS and Laplace. FDR 
uses the least time but has the worst accuracy. In comparison 
with FAR, the proposed method is slightly slower but more 
accurate. Finally, it can be observed that the proposed method 
has the most stable performance among the CS methods except 
BP by comparing the standard deviation of the metrics. We 





RMSE 


Time (s) 


# Nonzeros 


Linear 


0.1333 




4096 


BP 


0.1393 ±0.0008 


42.2 ±4.03 


4096 ± 


FDR 


0.1999 ±0.0487 


8.84 ±2.12 


2155 ± 122 


FAR 


0.1499 ±0.0033 


17.0 ±4.35 


1142 ±41 


BCS 


0.1423 ±0.0023 


27.2 ± 5.92 


1305 ± 67 


Laplace 


0.1429 ±0.0013 


25.7 ±6.11 


1218 ±65 


Proposed 


0.1448 ±0.0011 


21.3 ±4.09 


1049 ± 21 



note that BP can be accelerated using recently developed 
algorithms for l\ optimization. Fig. [6] shows examples of 
reconstructed images where faithful reconstructions of the 
Mondrian image can be observed. 

VI. Conclusion 

The sparse signal recovery problem in CS was studied in 
this paper. Within the framework of Bayesian CS, a new hi- 
erarchical sparsity-inducing prior was introduced and efficient 
signal recovery algorithms were developed. Similar theoretical 
results on the global and local optimizations of the proposed 
method were proven as that for the basic SBL. The main 
algorithm was shown to produce sparser solutions than its 
existing SBL peers. Numerical simulations were carried out 
to demonstrate the improved performance of the proposed 
sparsity-inducing prior and solution. The proposed G-STG 
prior preserves the general structure of existing hierarchical 
sparsity-inducing priors and can be implemented in other SBL- 
based methods with ease. 
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Fig. 6. The 512 X 512 Mondrian image (a) and its reconstructions using (b) linear reconstruction (RMSE = 0.1333) from N = 4096 wavelet samples 
and a multiscale CS scheme from M = 2713 linear measurements by (c) BP (RMSE = 0.1391, time = 44.7s and # nonzeros = 4096), (d) StOMP 
with FDR thresholding (RMSE = 0.1751, time = 10.3s and # nonzeros = 2014), (e) StOMP with FAR thresholding (RMSE = 0.1529, time = 15.7s 
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